Non-asymptotic approach to varying coefficient 

model 

Olga Klopp* and Marianna Pensky t 

Abstract 

In the present paper we consider the varying coefficient model which 
represents a useful tool for exploring dynamic patterns in many applica- 
tions. Existing methods typically provide asymptotic evaluation of pre- 
cision of estimation procedures under the assumption that the number 
of observations tends to infinity. In practical applications, however, only 
a finite number of measurements are available. In the present paper we 
focus on a non- asymptotic approach to the problem. We propose a novel 
estimation procedure which is based on recent developments in matrix es- 
timation. In particular, for our estimator, we obtain upper bounds for the 
mean squared and the pointwise estimation errors. The obtained oracle 
inequalities are non-asymptotic and hold for finite sample size. 

1 Introduction 

In the present paper we consider the varying coefficient model which repre- 
sents a useful tool for exploring dynamic patterns in economics, epidemiology, 
ecology, etc. This model can be viewed as a natural extension of the classical 
linear regression model and allows parameters that are constant in regression 
model to evolve with certain characteristics of the system such as time or age 
in epidemiological studies. 

The varying coefficient models were introduced by Cleveland, Grosse and 
Shyu [4] and Hastie and Tibshirani [7] and have been extensively studied in 
the past 15 years. The estimation procedures for varying coefficient model are 
e.g. based on the kernel-local polynomial smoothing (see e.g. [28, 8, 5, 12]), 
the polynomial spline (see e.g. [9, 11, 10]), the smoothing spline (see e.g. [7, 
8, 3]). More recently e.g. Wang et al [27] proposed a new procedure based 
on a local rank estimator; Kai et al [13] introduced a semi-parametric quantilc 
regression procedure and studied an effective variable selection procedure; Lian 
[20] developed a penalization based approach for both variable selection and 
constant coefficient identification in a consistent framework. For more detailed 



*MODAL'X, University Paris Ouest Nanterre (kloppolga@math.cnrs.fr) 

^Department of Mathematics, University of Central Florida (marianna.pensky@ucf.edu) 



1 



discussions of the existing methods and possible applications, we refer to the 
very interesting survey of Fan and Zhang [6] . 

Existing methods typically provide asymptotic evaluation of precision of esti- 
mation procedures under the assumption that the number of observations tends 
to infinity. In practical applications, however, only a finite number of mea- 
surements are available. In the present paper, we focus on a non- asymptotic 
approach to the problem. We propose a novel estimation procedure which is 
based on recent developments in matrix estimation, in particular, matrix com- 
pletion. In the matrix completion problem, one observes a small set of entries 
of a matrix and needs to estimate the remaining entries using these data. A 
standard assumption that allows such completion to be successful is that the 
unknown matrix has low rank or has approximately low rank. The matrix com- 
pletion problem has attracted a considerable attention in the past few years 
(sec, e.g., [2, 14, 19, 23, 16]). The most popular methods for matrix completion 
are based on nuclear-norm minimization which we adapt in the present paper. 

1.1 Formulation of the problem 

Let (Wi,ti,Yi), i = l,...,nbe sampled independently from the varying coeffi- 
cient model 

Y = W T f(t) + a^. (1) 

Here, W <G R p are random vectors of predictors, /(•) = (/i(-), . . . , / p (-)) T is 
an unknown vector- valued function of regression coefficients and t G [0, 1] is a 
random variable independent of W. Let [i denote its distribution. The noise 
variable £ is independent of W and t and is such that E(£) = and E(£ 2 ) = 1, 
a > denotes the noise level. 

The goal is to estimate the vector function /(•) on the basis of observations 
(Wi,ti, Yi), i — 1, ... ,n. Our estimation method is based on the approximation 
of the unknown functions fi(t) using a basis expansion. This approximation 
generates the coordinate matrix Aq. In the above model, some of the compo- 
nents of vector function / are constant. The larger the part of the constant 
regression coefficients, the smaller the rank of the coordinate matrix Aq (the 
rank of matrix Aq does not exceed the number of time-varying components of 
vector /(•) by more than one). We suppose that the first element of this basis is 
just a constant function on [0, 1] (indeed, this is true for vast majority of bases 
on a finite interval). In this case, if the component /i(-) is constant, then, it has 
only one non-zero coefficient in its expansion over the basis. This suggest the 
idea to take into account the number of constant regression coefficients using 
the rank of the coordinate matrix Aq. 

Our procedure involves estimating Aq using nuclear-norm penalization which 
is now a well-established proxy for rank penalization in the compressed sensing 
literature. Subsequently, the estimator of the coordinate matrix is plugged into 

the expansion yielding the estimator /(•) = f/i(-), . . . , /;>(•)) of the vector 
function f(t). For this estimator we obtain upper bounds on the mean squared 
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error — E — /ill^^) and on the pointwise estimation error — E \fi(t) — fi(i)\ 
pi — i p i — i 

for any t £ supp(/i) (Corollary 1). These oracle inequalities are non- asymptotic 
and hold for finite values of p and n. The results in this paper concern random 
measurements and random noise and so they hold with high probability. 

1.2 Layout of the paper 

The remainder of this paper is organized as follows. In Section 1.3 we introduce 
notations used throughout the paper. In Section 2, we describe in details our 
estimation method, give examples of the possible choices of the basis (Section 
2.1) and introduce an estimator for the coordinate matrix Aq (Section 2.2). 
Section 3 presents the main results of the paper. In particular, Theorems 1 
and 2 in Section 3 establish upper bounds for estimation error of the coordinate 
matrix Aq measured in Frobenius norm. Corollary 1 provides non-asymptotic 
upper bounds for the mean squared and pointwise risks of the estimator of 
the vector function /. Section 4 considers an important particular case of the 
orthogonal dictionary. 

1.3 Notations 

We provide a brief summary of the notation used throughout this paper. Let 
A, B be matrices in RP xl , pL be a probability distribution on (0, 1) and ip(-) be 
a vector-valued function. 

• For any vector r\ £ M. p , we denote the standard li and I2 vector norms by 
Nli and IMI 2 > respectively. 

• and (• , ■) Lo ( d ^ are the norm and the scalar product in the space 
L a ((0,1), dp). 

• For = (V>iG): • • ■ ^ P {-)) T 1 w e set U^OIloo = ™ ax sup l^(*)l 

i-i,..., P tgsU pp( p ) 

^ IWOIIl,^) = ^ IMIi a (d M ) 

• We define the scalar product of matrices (A, B) = tr(A T B) where tr(-) 
denotes the trace of a square matrix. 

• Let 

min(p,Z) / min(p,l) 

\\A\\.= E (Tj(A) and ||A|| 2 = £ o 2 AA) 

be respectively the trace and Frobenius norms of the matrix A. Here 
(o-j(A))j are the singular values of A ordered decreasingly. 

• Let \\A\\ = o-i (A). 

• For any numbers, a and 6, denote a V b = max(a, b) and a A b = min(a, b). 
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• Denote the k x k identity matrix by 

• Let (s — 1) denote the number of non-constant /»(•). 

• In what follows, we use the symbol C for a generic positive constant, which 
is independent of n, p, s and I, and may take different values at different 
places. 



2 Estimation method 

The first step of our estimation method is the approximation of the unknown 
functions fi(t) by expanding them over an appropriate basis. This approxima- 
tion generates the coordinate matrix Aq. Matrix Aq is estimated using penalized 
risk minimization. The estimator of the coordinate matrix is plugged into the 
expansion yielding the estimator of the vector function /. 

2.1 Basis expansion 

Let (</>i(-))i=i,...,oo be an orthonormal basis in L 2 ((0, 1), cf/z), I £ N and <j>(-) = 
(0i(-), ...,</>/(•)) . We assume that basis functions satisfy the following condi- 
tion: there exists < oo such that 

W(t)\\l = J2\Mt)\ 2 <cll, (2) 

J=l 

for any I > 1 and any t £ [0,1]. Note that this condition is satisfied for most of 
the usual bases. 

We introduce the coordinate matrix Aq £ M. pxl with elements 

a kj = {fk,<t>j) L ^ dli ), I,"' ,P, 3 = !,••• J- 

For each k = 1, . . . ,p, we have 

f k (t)= hal^ + pfit). (3) 

Denote the remainder by p^(-) = (fi\-), ■ • ■ , Pp\-)) T '• We assume that the 

i 

basis (4>i(-))i=i oo guarantees good approximation of by E a%^4>j{t), that 

.7 = 1 J 



is. 



Assumption 1. We assume that the basis satisfies condition (2) and that there 
exists a positive constant b such that, for any I > 1 

< tr, 7 >0. (4) 
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Often approximation in L 2 — norm gives better rates of convergence. In or- 
der to get upper bounds on the mean squared error we will use the following 
additional assumption: 

Assumption 2. There exist b\ > such that, for any I > 1 

pW(-) <m- (7+1/2) , 7>0 . 

L 2 (dfi) 

Let us give few examples of possible choices of the basis. 

Example 1. Assume that d/j, = g(t) alt and function g is bounded away from 
zero and infinity, i.e. there exist absolute constants g\ and 172 such that for any 
t G supp(^i) 

9\<g{t)<92, < gi < g 2 < 00. (5) 

Denote (j>j{t) = e 2t7rjt , j G Z, the standard Fourier basis of L 2 ((0, 1)). Then, 
it is easy to check that <pj(t) = </>j(t)/\/g(t), j G Z, is an orthonormal basis of 
1/2 ((0, 1),<?)- Moreover, condition (2) holds with = g^ 1 . 

For 7 > 0, consider the Sobolev space W 7 (0, 1) of functions F G £2(0, 1) with 
the norm ||F||^, = \uj\ 2l+1 \F(uj)\ 2 duj where F(u) is the Fourier transform 
of F. Then, by Theorems 9.1 and 9.2 of [22], one has 



£ \j\^ +1 \{F,^)\ 2 <C y \\F\\^, 



(6) 



J = -0O 



where C 7 is an absolute constant which depends on 7 only. Assume that for 
some A < 00 the functions fa belong to a Sobolev ball of radius A, i.e. 



max 



< A, 7 > 0. 



(7) 



Let I = 2N+ 1, so that 



JV 



/*(*)= E <4°(*) = E o «^(*)' 

i=-A r |i|>JV 

where = (fk(t)\/ g(t), <f>j(t))- Then, it follows from equations (5), (6) and 
(7) that 



P (l H-) 



< 9? 



9i 



E 

\j\>N 

E W" 27 - 1 < 

\j\>N 



max E \J\ 21+1 K 



\j\>N 



A Cry 



where N = (I - l)/2 and 



£3 GO 

so that Assumptions 1 and 2 hold. 
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Example 2. Consider a wavelet ip with a bounded support of length and 
with 7* vanishing moments and choose I = 2 H where H is a positive integer. 
Construct a periodic wavelet basis iph,i(t), h = — 1, ■ ■ • , J— 1, i = 0, • ■ ■ ,2—1, 
with ip-i, (t) = 1 and ^(i) = 2 h / 2 ip{2 h t - i) for h > 0. As in Example 1, set 
4>j{t) = 4>h,i{t) = iph,i(t) / y/g(t) where j = 2 h + i + 1. Note that condition (2) 
holds in this case with c 2 ^ = g^ 1 C^\\ip\\ 2 x:i . 

Then, each function fk(t) can be expanded into a wavelet series 

H-l 2 h -l oo 2 h -l 



fe=-l i=0 h=ff i=0 



/*(*) =EE <m<m*), Pk(t) = E E a 2,M^M(*)» 



where ag )/M = (/*(•) a/pO, VmO)- 

Theorem 9.4 of [22] states that for F G W 7 (0, 1) one has 

oo 2 h -l 

^2'^+ 1 )^|(F,V^)| 2 <C 7 ||F||^, 

h=-l i=0 

where C 7 is an absolute constant which depends on 7 only, provided 7 < 7*. 
Then, under assumptions (5) and (7), as in Example 1, Assumption 1 holds. For 
example, recalling that H = log 2 I and that length of support of ip is bounded 
by Cip, obtain 

2 00 2 h -l 

<2-»<*+* max E 2W) El<Ml 2 <^- (2 ^ +1) , 

(s) k=l,-,p f-~L J—f 



2"-l 

fi.il 

P 



< (27.91)-^ II^HL 2- 2 ^ max 2 2 W £ |a° fe 

fe=-l i=0 

< A 2 (2 igi y 1 c^ ml r 2 \ 

where Halloo = sup t \ip(t)\. 

Example 3. Suppose that fi(t) belong to a finite k— dimensional sub-space of 
L-x ((0, 1), dfi). For example, are polynomials of degree less than k. Then, 
choosing I = k and an orthonormal basis in this sub-space, we have trivially 
p(0(-)=0. 

2.2 Estimation of the coordinate matrix 

Denoting X = W<p T (t), we can rewrite (1) in the following form 

Y = tr (A X T ) + W T p^ (t) + at (8) 

We suppose that some of the functions fi(-) are constant and let (s — 1) denote 
the number of non-constant fi(-). This parameter, s, plays an important role 
in what follows. Note that rank(Ao) < s. 
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Using observations (Yi,Xi) we define the following estimator of Aq: 



i = argmin|-^(y i -<JQ ) ^)) 2 +A ||A||,J, (9) 

where A is the regularization parameter. This penalization, using the trace- 
norm, is now quite standard in matrix completion problem and allows one to 
recover a matrix from under-sampled measurements. 

Using estimator (9) of the coordinate matrix Aq, we recover f(t) as 

fit) = A<j>(t). 

2.3 Assumptions about the dictionary and the noise 

We assume that the vectors Wi are i.i.d copies of a random vector W having 
distribution II on a given set of vectors X . Using rescaling, we can suppose 
that ||W|| 2 < 1 almost surely. Let E(iyiU T ) = £1 and cj max , u min denote 
respectively its maximal and minimal singular values. We need the following 
assumption on the distribution of W. 

Assumption 3. The matrix £1 = E (W W T ) is positive definite. 

Let Il^-H x, 2 (n®/i) = ^ ((^i^) 2 )- An easy computation leads to 

ll^llL(n^)=E((^,A^)) 2 ) 

= Et (E w ((W,A0(t)) 2 )) 

and 

E w ((W^^t)) 2 ) = E w (tr ({A<j>{t)) T WW T A<i>{t))) 
= E w (tr (w W T A(j){t) (A0(t)) T )) 
= (E w (W t W) , A 4>{t) (A 4>{t)) T ) 

= (n,A4>(t) [A<p{t)) T ). 

By definition we obtain 

(n,A<j>(t) (Acj){t)) T ) > "ruin \\A4>(t)\\l . 

Finally wc compute 

\\Ml a <p 9 ri>Uuto^(\\A<j>(t)\\l) =Umin\\Af 2 (10) 

where in the last display we used that (^>j(-))j = i oo is an orthonormal basis in 

L 2 ((0,l),dA*)- 

We consider the case of sub- exponential noise which satisfies the following 
condition 
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Assumption 4. There exist a constant K > such that 

max Eexp (\£i\/K) < e. 

i— l,...,n 

For instance, if are i.i.d. standard Gaussian we can take K = 1. 



3 Main Results 

Let 

Sfl = -X]^ and s = ~£ (W?p®{ti) + tr&) X t 

n i=l n i=l 

where {ei}™ =1 is an i.i.d. Rademacher sequence. These stochastic terms play an 
important role in the choice of the regularization parameter A. 
We introduce the following notations: 



M = tr(0) V (iw max ) and n* 



CAl\og{d) 



[(Ms) V 1] 



The following theorem gives a general upper bound on the prediction error for 
the estimator A given by (9). Its proof is given in Appendix A. 

Theorem 1. Let A > 3 ||S|j and suppose that Assumption 3 holds. Then, with 
probability at least 1 — 2/d, 



A-A 



< C 



A 2 + Po| 



(ii) If, in addition n > n** , then 



A-A 



< 



CsX 2 



where d = I + p. 

In order to obtain upper bounds in Theorem 1 in a closed form, it is necessary 
to obtain a suitable upper bound for ||S||. The following lemma, proved in 
Section E, gives such bound. 

Lemma 1. Under Assumptions 1 - 4, there exists a numerical constant c* , that 
depends only on K , such that, for all t > with probability at least 1 — 2e _t 

2by/s~ 1' 



E < ac 



M(i + log(d)) 



dp Vl (t + log(rf)) ( K log 



Kc<j, 



V 1 



(11) 



where d = p + I. 



The optimal choice of the parameter t in Lemma 1 is t = log(rf). Larger 
t leads to a slower rate of convergence and a smaller t does not improve the 
rate but makes the concentration probability smaller. With this choice of t, the 
second terms in the maximum in (11) is negligibly small for n > n* where 



2c 2 l 



K log 



VI log(d) 



M 

In order to satisfy condition A > 3 ||E|| in Theorem 1 we can choose 



A = 4.25 c*a 



2b^/s~T\ / M log(d) 



(12) 



If are N(0, 1), then we can take c* = 6.5 (see Lemma 4 in [15]). 
With these choices of A, we obtain the following theorem. 

Theorem 2. Let Assumptions 1-4 hold. Consider regularization parameters 
A satisfying (12) and n > n* . Then, with probability greater than 1 — A/d 

(i) 



A-A 



< C 



max < a 



b 2 (s - 1) 



2 \ Mslog(d) c Poll: /log(d)Z 



(ii) If, in addition n > n** , then 



A- A< 



<C [a 



b 2 (s-l)\ Ms log(d) 



t 2 t 



Using A we define the estimator of fit) as 

/(*) = (hit), ...Jp (t)Y = A <f>(t). 



(13) 



Theorem 2 allows to obtain the following upper bounds on the prediction 
error of f(t). 

Corollary 1. Suppose that the assumptions of Theorem 2 hold. With probability 
greater than 1 — A/d, one has 

(a) \ft e supp(/i) 



1 p , f ^ , m C U(t)\\t (3 2b 2 s 
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(b) If, in addition, Assumption 2 holds 



wher 



b 2 {s- 1)\ Ms log(d) 



/2o 



z/ n > n* 



max < cr 



^ +* Poll* 



Ms log(d) c Po||, v/bgp)!^ 



\f not. 



Proof. We shall prove the second statement of the corollary, the first one can be 
proved in a similar way. Let A % denote the i-th row of a matrix A. We compute 



Mt) - AV(f) < - A^(t)\\ + Ui - a) 0(t) 



L 2 (dfi) 



Pi°(*) 



L 2 (dfj.) 



A l Q A 



(14) 



where in the last display we used that (<^(-))i=i,...,oo is an orthonormal basis. 
Using (14) and Assumption 2 we derive 



p 



s ||/, -.MIL 



2b?s 
< — 1 - + 2 



j = l"- " --■■L 2 (d^) — j(2 7 +l) 

Now Theorem 2 implies the statement of the corollary 



A-A 



□ 



4 Orthonormal dictionary 

As an important particular case, let us consider the orthonormal dictionary. Let 
(ej)j be the canonical basis of W p . Assume that the vectors Wj are i.i.d copies 
of a random vector W which has the uniform distribution II on the set 

X = {e h l<j<p}. 

Note that this is an unfavorable case of very "sparse observations" , that is, each 
observation provides some information on only one of the coefficients of f(t). 

In this case, f2 = — I p , w max = cj m j n = — and we obtain the following values 
P P 

of parameters 



M 



IV p 

1 

P 

-2 ,2 (v „\ 2 



2K* log 2 (Kp) log(d) (lAp), 




A = 4 . 25(CV+ 26 ^ 1 ^ /avp)log(d) 



(15) 



n** = Ccllsp(lVp) log(d). 
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Plugging these values into Corollary 1, we derive the following result. 

Corollary 2. Let Assumptions 1 and 4 hold. Consider regularization parameter 
A satisfying (15), and n > n* . Then, with probability greater than 1 — 4/d, one 
has 

(a) Vt £ supp(/x) 



I£ l/l(iW , (i)l <^» + ^i 

pi=i n pl z ~ 



(16) 



^ //, in addition, Assumption 2 holds 



1 v 2 
-,5JI/i - fih,(d M ) ^ — r 



C p 2b\s 



pi=l 



(17) 



wher 



r 2 , fr 2 



= 4 



& 2 ( s -l) 



(/ Vp) s log(d), if n>n* 



/2, 



+ Z Poll, (IV p) a log(d), s/not 



Remarks. Optimal choice of parameter I: The upper bounds given in 
Corollary 2 indicate the optimal choice of parameter I. From (15) we compute 
the following values of I: 



1* 



C&sp 2 log(d) 



if I < p 



and 



Cc\sp log(d) 



if / > p. 



Let 



Fi(i) = C a 2 + 



2 t 2 (s-l)\ ps log(d) 2 6 2 s 



/2 7 



n p/( 2 7+i) : 



F 2 (l) = F^l) + I \\Ac 



2 ps log(rf) 



F 3 (l) = C a 2 + 



& 2 (s- 1)\ Zs log(d) 2 6 2 s 



/2 7 



7i pi(*r+i)' 



F 4 (0 = F 3 (0 + Z 2 Pol 



2 s log(d) 
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Let 7 > 1/2 and consider first the case sp 3 log(d) >n>sp 2 log(d) (the symbol 
< means that the inequality holds up to a multiplicative numerical constant). 
Then. Corollary 2 implies that 



1 p ~ 

B 1=1 



«llL 2 (d M ) 




if 1 < I < If 
if 11 < I < p 

if I > p. 



On [1, Z*], Fi(Z) achieves its minimum at Z*. Note that i*i(ZJ) < F 2 (Z) for any I G 
[Zf,p] and Fi(ZJ) < F 4 (Z) for any Z > p. Then, for sp 3 log(d) > n > sp 2 log(d) 
the optimal value of Z minimizing (17) is 



/i 



Gc\sp 2 log(d) 



When n> sp 3 log(cZ), the Corollary 2 implies that 



1 p 



SH/i-Zilli, 



Let 



I "i 



< 



Fi(Z), if l<Z<p 

^3(0. if P<1<1*2 

F 4 (Z), if I>Q. 



1 



2 7 + 2 



3 \a 2 p log(d). 

On [p, F 3 (Z) achieves its minimum at ZJ if p 3+27 log(d) > n > sp 3 \og(d) 
and at l* 3 if n > p 3+27 log(d). Note that F 3 (l() < F X (Z) for any Z G [l,p] and 
^3 ('2) < F i( l ) for an y ' > l 2- Then, for p 3+2 T log(d) > n > sp 3 log(d) the 
optimal value of Z minimizing (17) is 



Z, 



Cc 2 sp log(d) 



and for n > p 3+2 '>' log(cZ) the optimal value of Z is 

1 

h 



2 7 + 2 



a 2 p log(rf) / 

Minimax rate of convergence: For p = 1 the optimal choice of Z in (17) is 

1 



(2 7 +l)b 2 n^ 2 7 + 2 



cr 2 log(rf) 



2 7 + l 

With this choice of Z, the rate of convergence given by Corollary 2 is n 

Note that for / G W 7 (0, 1) we recover the minimax rate of convergence as given 

in e.g. [26]. 
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Appendix 

A Proof of Theorem 1 

This proof uses ideas developed in the proof of Theorem 3 in [16]. The main 
difference is that here we have no restriction on the sup— norm of Aq. This 
implies several modifications in the proof. 

It follows from the definition of the estimator A that 

1 n 2 1 n 

-Y,(Yi-{Xi,A)) +M\M\* < -^(^-(X i ,^o» 2 + A||A |U 

i=l i=l 

which, due to (8), implies 

i ™ 2 
±£ ((x h A -A) + WfpWfr) + 6) + A||i||* < 



i=i 



1 71 2 

^K" (,) ^ + ^) +M\Ao\U- 

(18) 



n 
i=l 



1 n 

Set H = A - A and S = (W? pW(U) + X t . Then, we can write (18) 

n i=l 

in the following way 

n 

-Y (X^H) 2 + 2(^H) +\\\A\\* < X\\Ao\U- 

i=l 

By duality between the nuclear and the operator norms, we obtain 
1 " 

(Xi,H) 2 + X\\A\U < 2 ||S|| ||#||* + APo||*. (19) 



i=i 



Let Pg denote the projector on the linear subspace S and let S 1 - be the 
orthogonal complement of S. Let Uj{A) and fj(^4) denote respectively the left 
and the right orthonormal singular vectors of A, S\ (A) is the linear span of 
{uj(A)}, S 2 {A) is the linear span of {Vj(A)}. For A, B <E W xl we set Pi(B) = 
PsHA) BP sHA) and f a(B) = B-Pi(B). 
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By definition, for any matrix B, the singular vectors of P^ o (B) are orthog- 
onal to the space spanned by the singular vectors of Aq. This implies that 
\\Ao + Pi = 114,11, + ||Pi (ff)||„. Then we compute 



A + H 

= \\A + Pi o (H) + P Ao (H)\l 

> \\A + Pi a (H)l-\\P An (H)\l 

= ||Ao|L + ||Pi (^)||,-||P Ao (H)|| jt . 

From (20) we obtain 

||Ao|L- A <\\P An (H)\l-\\Pi a (H)l. 

* 

From (19), using (21) and A > 3 ||S|| we obtain 

n 

-Y, (X, h) 2 < 2 ||S|| I|Pa (h)L + A ||Pa (h)\ 

71 * * 



<-A||P Ao (ir)l 



(20) 



(21) 



(22) 



Since P A (B) = P S ± {A) BP S2{A} + P Sl{A) B and rank {P S%(A) B) < rank (A) we 
derive that rank (P A (B)) < 2 rank (A). From (22) we compute 



1 ™ *i 
-J2(X t ,H) 2 <-\V2R\\H\\ 



(23) 



where we set R — rank (Aq). 

For < r < m = min [p, I) we consider the following constraint set 



C(r)= ||A|| 2 <1, P|| 



> 



/ 64 log(rf) I 
log (6/5) n 



M\l<V-r\\A\\ 2 \ (24) 



where PH^m^ = ^((X,A) 2 ). Note that the condition \\A\\„ < x/fPII 2 is 
satisfied if rank(A) < 7-. 

The following lemma shows that for matrices A G C(r) we have some ap- 
proximative restricted isomctry. Its proof is given in Appendix B. 



Lemma 2. For all AeC(r) 



n 1 
-E( X -A) 2 >-\\A\\ 



44 ell r 



(E(iiSflii)r 



with probability at least 1 — — . 
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We need the following auxiliary lemma which is proved in Appendix D. 
Lemma 3. If Xx > 3 ||S|| 

\\Pi (H)l<5\\P Ao (H)\l. 

Lemma 3 implies that 

\\H\l<6\\P Ao (H)\l 
<V72R\\H\\ 2 . 



TfllHll 2 -> llHll 2 / 64 lQ g( rf ) 1 



, (25) implies that G C (72 R) 

\\ H h 

and we can apply Lemma 2. From Lemma 2 and (23) we obtain that with 
2 

probability at least 1 one has 

a 

1 5 , 3168 c^Zi? 2 2 

dl#ll! 2 (n^) < ^Av^R||g|| 2 + f[ (E(||E«||)) 2 . (26) 

The following Lemma, proved in Section E.2, gives a suitable bound on E ||Er||: 

Lemma 4. Let (e;)™=i be an i.i.d. Rademacher sequence. Suppose that As- 
sumption 3 holds. Then, 



V n 

where d = p + I and M = tr(fi) V (Iw majt ). 
Using Lemma 4, (10) and (26) we obtain 

,2^ 10,./oo„ ff „ , Gc 2 /i?A/log(d) 



Wnun||^||^ < -f AV2~R||^|| 2 + —2 ^ . (27) 

On the other hand, equation (19) and the triangle inequality imply that 

A||i|| Jt <2||S||||i||*+2||S||m ||* + A||Ao||* 
and A > 3 ||S|| gets 

||i||2<||i!|*<5Po||*. (28) 
Putting (28) into (27) and using rank(A ) < s we compute 



n I 

which implies the statement (i) of Theorem 1 in the case when \\H H^m®/*) > 



2 64 log(d)/ 
c<£ ll#1l 2 ' 



log (6/5) n 
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„ ,,2 „ ,,2 / 64 log(ti) Z , . 

If ll^ll^n®,,) < c <$> H^lla U log (6/5) n ' WmS (10) ' W6 6 



llffll 2 ^- II ITU 2 / 64 l °&( d ) 1 foQ, 

^\\H\\ 2 <c,\\H\\ 2 J log{6/5)n (29) 



Then (28) implies 



3< Cc ± JM; Jlog(d)l 



This completes the proof of part (i) of Theorem 1. 

CAlsM log(d) 
If, in addition n > 2 g > from (27) we obtain 

min 

^\\H\\l<^\V^\\H\\ 2 + ^fi.\\H\\ 2 2 

and 



mm 

On the other hand, for n > n** (29) does not hold. This completes the proof of 
Theorem 1. 



B Proof of Lemma 2 



44c^r(E(||S K ||)) 2 



Set£ 

^min 

bad event is small 



B = < 3 A e C(r) such that 



We will show that the probability of the following 



n 

J2(Xi,A) 2 -\\A\\ 



> 2 H^ll ia(n®^) 



Note that B contains the complement of the event that we are interested in. 
In order to estimate the probability of B we use a standard peeling argument. 

T i / 64 log(d) I ■. 6 j 

Let v = caa ; , , ' and a = -. For k G N set 

v y log (6/5) n 5 

4 = {ie C(r) 



,fc-i. 



^<PIIL(n^)<«^}- 
If the event B holds for some matrix A € C(r), then A belongs to some Sk and 

1 



-J2 {Xi,A) 2 -\\A\\ 



(30) 
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For each T > v consider the following set of matrices 



C(r,T) = {AeC(r) : \\Af La{U9ll) < t} 
and the following event 



B k = \ 3AeC(r,a k v) 



1 " 

-J2(X l ,A) 2 -\\A\ 



z-2(n®/i) 



; = 1 



Note that A e S k implies that A G C(r, a k v). Then (30) implies that Bk holds 
and we obtain Be Ufife. Thus, it is enough to estimate the probability of the 
simpler event Bk and then to apply the union bound. Such an estimation is 
given by the following lemma. Its proof is given in Appendix C. Let 



Zt = sup 

A£C(r,T) 



1 ™ 



Lemma 5. 



Z T >-T 



— (E HEflll) < exp -g— 

^min / \ fc 



where C3 



128 



/ C3 71 Q; 2 ^ \ 

Lemma 5 implies that P < exp — . Using the union bound 

V c 4> 1 J 

we obtain 



P (B) < S P (6 fe ) < S exp 
fc=i fc=i 



9Jr 9 

C3 n a f 



< £ exp 
fe=l 



(2 C3 n log(a) f 2 ) A; 



where we used e x > x. We finally compute for v = cp 



I 64 log(ri) I 
log (6/5) n 



exp 



P(B) < 



1 — exp 



2 C3 nlog(a) ^ 2 \ 

/ exp (- log(cQ) 

2 c 3 n log(a) ^ 2 \ 1 - exp (- log(d)) ' 
cTl 



This completes the proof of Lemma 2. 
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C Proof of Lemma 5 



Our approach is standard: first we show that Zt concentrates around its ex- 
pectation and then we upper bound the expectation. By definition, 



Zt = sup 

A£C(r,T) 



1 n 

-]T<^,a) 2 -e(<x,a) 2 ) 



Note that 



\(X t ,A)\<\\W\\ 2 \m)\\2\\M2<^ yfl, 



where we used ||W|| 2 < 1 and condition (2). 

Massart's concentration inequality (see e.g. [1, Theorem 14.2]) implies that 



(31) 



where c 3 = — . 

Next we bound the expectation E(Zy). Using a standard symmetrization 
argument (see Ledoux and Talagrand [21]) we obtain 



E [Z T ) = E sup 

\AeC(r,T) 

< 2E ( sup 



1 n 

-Y,{X l ,A) 2 -E[{X,A) 2 ) 

i=l 

1 ™ \ 

n U ) 



where is an i.i.d. Rademacher sequence. Then, the contraction inequal- 

ity (see Ledoux and Talagrand [21]) yields 



E{Z T ) < 8c \/Ie sup 

\AeC(r,T) 



n \ 
-^{X^A) 

n U J 



8c VZE sup \{T, R ,A)\ 

\AeC(r,T) 



where E_r = -Vf.I,. For AeC{r, T) we have that 

"i=i 

NL<VF|H| 2 

^ll^lli 2 (n^) 



< 



< 



where we have used (10). 
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Then, by duality between nuclear and operator norms, we compute 
E(Z T ) < 8c \flE ( sup \(Er,A)\ 



I 7 rj~i 

< 8c J E(||Efl|| 



Finally, using 



15„ „ [TrT„„,„ „, /l 8\ 5„ 44c?,Zr 



and the concentration bound (31) we obtain that 

P Z T >^-T+ — *— (E (||£r||)) < exp ' 



2 



12 w min / \ cil 



where c-j = as stated. 

d 128 



D Proof of Lemma 3 

Using (19) we compute 

AflliHi-Polli) <2||S|||| J ff|| 1 . 



The condition A > 3 ||S||, the triangle inequality and (21) yield 

A (WpUh)^ - \\P Ao (H)\\i) < |a (||Pi (ff)^ + \\Pa (H)\\i) ■ 
This implies that 

llP^Cif)!^ < swpa^h)^. 

as stated. 

E Bounds on the stochastic errors 

In this section we will obtain upper bounds for the stochastic errors ||S||, ||Er||. 
Recall that 

Sjj=-Veili and S = -V fwf pWfe) + a^) X { (32) 

i=l i=l 

where {ej}?=i is an i-i.d. Rademacher sequence. 
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The following proposition is the matrix version of Bernstein's inequality in 
the bounded case (see Theorem 1.6 in [25]). Let Z\,...,Z n be independent 
random matrices with dimensions m\ x m 2 - Define 



oz = max ■ 



n Z — < n L — < \ 

t=l i=l 



1/2' 



Proposition 1. Let Z\,...,Z n be independent random matrices with dimen- 
sions m\ x TO2 that satisfy E(Zi) = 0. Suppose that \\Zi\\ < U for some constant 
U and all i = 1, . . . , n. Then, for all t > 0, with probability at least 1 — e - * we 
have 



1 

n — ^ 



n 

where d = mi + ma . 



,, * + log(<*) rr t + log(d) 
< 2 max < o~z\ , t/ - 



It is possible to extend this result to the sub-exponential case. Set 

Ui = inf {K > : Ecxp (\\Z l \\/K) < e} . 

The following proposition is obtained by an extension of Theorem 4 in [18] to 
rectangular matrices via self-adjoint dilation (cf., for example 2.6 in [25]). 

Proposition 2. Let Z\, . . . , Z n be independent random matrices with dimen- 
sions mi x TO2 that satisfy E(Zj) = 0. Suppose that Ui < U for some constant 
U and all i = l,...,n. Then, there exists an absolute constant c* , such that, 
for all t > 0, with probability at least 1 — e - * we have 



1 n 

T7 ^ ' 



< c* max < oz 



t + log(d) 



U lo 



i=l 

where d = mi + m 2 . 

We use Propositions 1 and 2 to prove Lemmas 1 and 4. 



U \ t + log(rf) | 



E.l Proof of Lemma 1 

In [ n 

Let Si = - ^Wfp^it^X, and £ 2 = ~E6^- Then, we obtain S = Si + 

n i= i n i= i 

oYj-2- In order to derive an upper bound for ||£ 2 ||, we apply Proposition 2 to 

Zi=£ i X i = i i Wi<t> T {ti). 

We need to estimate oz and U. Note that Zi is a zero-mean random matrix 
such that 

\\Zi\\ < |6| ||Wi0 T (ti)|| a = 16111^(^)113 
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where we used condition (2) and ||W|| 2 < 1- Then, Assumption 4 implies that 

there exists a constant K such that Ui < K c<f, vl for all i = 1, . . . , n. 

1 ™ 

Let us estimate az for Z = £W(j) T (t). First we compute — (-^ ^ T ) : 



i n 1 71 

i=l i=l 

= ^{\m)\\lww T ) 

= 19, 



(33) 



where we used E(£ 2 ) = 1. 



Now we compute -VE (zf Zi): 

i=l i=l 

= e(0(^ t W||^||^) 
= tr (0) I, 

where we used that (4>i{-))i=i,...,oo is an orthonormal basis in L2 ((0, I), dp). 
Equations (33) and (34) imply that 

0"! < (iw ma x) V tr (17) and cr| > Zw max . 

Applying Proposition 2 we derive that for all t > with probability at least 
1-e-* 



IE2II < c* max < 



M (* + log(d)) 



K C(j> Vl (t + log(d)) log 



^ (35) 



where M = tr(fi) V (iw max ). 

One can estimate ||£i|| in a similar way. We apply Proposition 1 to 

= wT P ^{U)W l cf r {U). 



We begin by proving that 



E 



(w T p«\t)W<t> T {t)) =0. 
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Let W = (w 1 ,...,w p ). The (m, fc)-th entry of W T (t)W <f> T (t) is equal to 



Wj p- (t) w m 4>k (t) . By definition pf (t) = /,■ (t) — E a%4>i (t) and we compute 

1 ^ i— 1 



E (pf(t)0 fc (t)) = E ( f / 3 -(t) - t S4<Mi)J fc (t) 



= E(/ i (^ fc (t)- s^^c^ct) 



= a° fe - a° fc = 



since (0j( - ))i=i,...,oo is an orthonormal basis. Therefore, 

E f .Swj-pf (t)w m (f> k (tyj = EE W (wjW m E t (pf(t)Mt))) 



Next we estimate U. Note that p^ l \t) has at most s— 1 non-zero coefficients. 
Then, Assumption f and ||W|| 2 < 1 imply that <— almost surely 

(W T p«\t)f < ^ { 17 1} and 



Z 2 7 



|2i|| < I Wf^(ti) | II Wi0 r (*i) I 



< 



&C0 (s - 1) 



Let us estimate a z for Z = (W T (i)) W</> T (i). First we compute - £E (-^ Zf) 



i=l 



E t ||^)||2 E w 



((wv>(*)) ! 



We obtain 



®w ((l^p«(t))W] < &2 ( ' 27 1} E (W 7 ^ 



where we used WW T > 0. Finally we obtain 



n 



< 



b 2 (s - l)a; max l 



(36) 



1 ™ 

Now we compute - V E (Zf ZA : 

n i= i 



±-±E(Z?Z t ) 



i=l 



w T p®(t)) 4>(t)w T W0 T (t) 



E t ((w T p^(t)) \\W\\ 2 2( f>(W T (t) 
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Using E (HWH2) = tr(fi) and E t (cj)(t)<p T {t)) = 1/ we obtain 



< 



b 2 (s - 1) 
p7 



tr(fi). 



(37) 



Equations (36) and (37) imply that 

2 . b 2 (s - 1) 



tr(fi) V(Zw max )]. 



Applying Proposition 1, we derive that for all f > with probability at least 
l-e-* 



Li < ; max ■ 

- ; 7 



M(t + bg(d)) c V?(t + log(d)) 



(38) 



The bounds (38) and (35) imply that for all t > with probability at least 
1 - 2e _t 



, 2by^~~l\ lM(t + log(d)) 

I ac* H 2L_ ) max • 



C(j> Vl (t + log(d)) 



X log 



A" 



V 1 



as stated. 



E.2 Proof of Lemma 4 

The proof follows the lines of the proof of Lemma 7 in [17] . We use Proposition 1 
with Zi = €i Xi. As in the proof of Lemma 1, we obtain U = \fl and a 2 z = 
(tr(fi) V (Z<7 max (fi))). Set M = (tr(fi) V (Za max (0))), then Proposition 1 implies 
that for alH > with probability at least 1 — e~ f 




M(t + Iog(d)) \/Z(t + log(d)) 



(39) 



Set i* = — log(d) so that t* is the value of t such that the two terms in 

(39) are equal. Note that (39) implies that 

P(||E fl ||>t)<dexp|-^} for t < t* (40) 

and 

P(||S fl || > t) < dexp|-^=j for t > t* . (41) 
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TI TI 

We set i/i = , vj = = . By Holder's inequality we derive 

AM 2Vl 

>r t;\\ l/(2 1og(d)) 

EHEflU < (E||S r l|21os(d) ^ 



Inequalities (40) and (41) imply that 

/+00 \l/21og(d) 

(E!|s fl || 2l0 ^) 1/2l0s(d) = |p(||s fl || >t^^)dt\ 

/ +00 +00 \ l/21og(d) 

< Id J expj-^/'^Vjdt + d J expi-tW^W^dt 
V 

< v^(iog(d)^r los(d) r(iog(d)) + 2io g (d) i ,- 21og(d) r(2i0g(d))) 1/(21osW) . 

(42) 

Recall that Gamma-function satisfies the following inequality 

I» < 1 for z>2, (43) 

(see e.g. [17]). Plugging (43) into (42) we compute 

E||S fl || < V^f(log(rf)) log(d Vr 1 ° s(d) 2 1 - log(d ) 



+ 2(log(d)) 2 ^),- 21o ^ Xl/ 



2 

2 



Observe that n> n* implies v\ log(d) < i^j and we obtain 



E M <J^i^. (44) 

V ^1 

Ti 

We conclude the proof by plugging v\ — into (44). 

4 M 

References 

[1] Biihlmann, P. and van de Geer, S. (2011) Statistics for High- Dimensional 
Data: Methods, Theory and Applications. Springer. 

[2] Candcs, E.J. and Rccht, B. (2009) Exact matrix completion via convex 
optimization. Foundations of Computational Mathematics, 9(6), 717-772. 

[3] Chiang, C.-T., Rice, J. A. and Wu, C. O. (2001). Smoothing spline esti- 
mation for varying coefficient models with repeatedly measured dependent 
variables. J. Amer. Statist. Assoc., 96, 605619. 



24 



[4] Cleveland, W.S., Grosse, E. and Shyu, W.M. (1991) Local regression mod- 
els. Statistical Models in S (Chambers, J.M. and Hastie, T.J., eds), 309-376. 
Wadsworth and Books, Pacific Grove. 

[5] Fan, J. and Zhang, W. (1999). Statistical estimation in varying coefficient 
models. Ann. Statist, 27, 14911518. 

[6] Fan, J., and Zhang, W. (2008) Statistical methods with varying coefficient 
models. Statistics and Its Interface, 1, 179195. 

[7] Hastic, T.J. and Tibshirani, R.J. (1993) Varying-coefficicnt models. J. Roy. 
Statist. Soc. B. (Chambers, J.M. and Hastie, T.J., eds), 55 757-796. 

[8] Hoover, D. R., Rice, J. A., Wu, C. O. and Yang, L.-P. (1998). Non- 
parametric smoothing estimates of time-varying coefficient models with 
longitudinal data. Biometrika, 85, 809822. 

[9] Huang, J. Z., Wu, C. O. and Zhou, L. (2002). Varying-coefficient models 
and basis function approximations for the analysis of repeated measure- 
ments. Biometrika, 89, 111128. 

[10] Huang, J. Z. and Shen, H. (2004). Functional coefficient regression mod- 
els for nonlinear time series: A polynomial spline approach. Scandinavian 
Journal of Statistics, 31, 515534. 

[11] Huang, J. Z., Wu, C. O. and Zhou, L. (2004). Polynomial spline estima- 
tion and inference for varying coefficient models with longitudinal data. 
Statistica Sinica, 14, 763788. 

[12] Kaucrmann, G. and Tutz, G. (1999). On model diagnostics using varying 
coefficient models. Biometrika, 86, 119128. 

[13] Kai, B., Li, R., and Zou, H. (2011) New efficient estimation and variable se- 
lection methods for semiparametric varying-coefficient partially linear mod- 
els. Ann. Stat., 39, 305-332. 

[14] Keshavan, R.H., Montanari, A. and Oh, S. (2010) Matrix completion from 
a few entries. IEEE Trans, on Info. Th., 56(6), 2980-2998. 

[15] Klopp, O. (2011) Matrix completion with unknown variance of the noise, 
http : //arxiv . org/abs/1112 . 3055 

[16] Klopp, O. (2012) Noisy low-rank matrix completion with general sampling 
distribution. Bernoulli, to appear. 

[17] Klopp, 0.(2011) Rank penalized estimators for high-dimensional matrices. 
Electronic Journal of Statistics, 5, 1161-1183. 

[18] Koltchinskii, V. (2011) A remark on low rank matrix recovery and non- 
commutative Bernstein type inequalities. IMS Collections, Festschritt in 
Honor of J. Wellner. 



25 



[19] Koltchinskii, V., Lounici, K. and Tsybakov, A. (2011) Nuclear norm pe- 
nalization and optimal rates for noisy low rank matrix completion. Annals 
of Statistics, 39(5), 2302-2329. 

[20] Lian, H. (2012) Spline Estimator for Simultaneous Variable Selection 
and Constant Coefficient Identification in High-dimensional Generalized 
Varying- Co efficient Models. Manuscript. 

[21] Lcdoux, M. and Talagrand, M. Probability in Banach Spaces: Isoperimetry 
and Processes. Springer- Verlag, New York, NY, 1991. 

[22] Mallat, S. (2009) A Wavelet Tour of Signal Processing, Third Ed., Elsevier, 
New York 

[23] Negahban, S. and Wainwright, M. J. (2010). Restricted strong convexity 
and weighted matrix completion: Optimal bounds with noise. Journal of 
Machine Learning Research, 13, 1665-1697. 

[24] Scnturk, D. and Mueller, H. G. (2010) Functional varying coefficient models 
for longitudinal data. J. Amer. Statist. Assoc., 105, 1256-1264. 

[25] Tropp, J. A. (2011) User-friendly tail bounds for sums of random matrices. 
Found. Comput. Math., 11(4). 

[26] Tsybakov, A. (2010) Introduction to Nonparametric Estimation, Springer 
Series in Statistics. 

[27] Wang, L., Kai, B., and Li, R. (2009) Local Rank Inference for Varying 
Coefficient Models. J. Amer. Statist. Assoc., 104, 1631-1645. 

[28] Wu, C. O., Chiang, C. T. and Hoover, D. R. (1998). Asymptotic confi- 
dence regions for kernel smoothing of a varying-cocfiicient model with lon- 
gitudinal data. J. Amer. Statist. Assoc., 93, 13881402. 

[29] Yang, L., Park, B.U., Xuc, L. and Hardle, W. (2006) Estimation and Test- 
ing for Varying Coefficients in Additive Models With Marginal Integration. 
J. Amer. Statist. Assoc., 101, 1212-1227 



26 



